Hector Corrada-Bravo and Rafael A. Irizarry
May 1, 2016
We have described two type of ML algorithms:
The first is lacks flexibility
The second is does not do well with the curse of dimensionality.
or, if we want to include 10% of the data we need to increase the window size to \( \sqrt{10} \):
To include 10%, each dimension needs to cover \( .10^{1/p} \)

We use Regression Trees for continuous outcome:
The important observation is that Regression Trees create partition recursively
A recursive algorithm would look like this:
\[ \sum_{i:\, x_i \in R_1(j,s))} (y_i - \hat{y}_{R_1})^2 + \sum_{i:\, x_i \in R_2(j,s))} (y_i - \hat{y}_{R_2})^2 \]
Where \( R_1 \) and \( R_2 \) result from splitting predictor \( j \) at \( s \):
\[ R_1(j,s) = \{X|X_j < s\} \mbox{ and } R_2(j,s) \{X|X_j \geq s\} \]
When do we stop partitioning? We stop when:
Even then, trees built with this stopping criterion tend to overfit training data.
To avoid this, a post-processing step called pruning is used to make the tree smaller.
tree_control <- tree.control(
nobs = nrow(polls_2008),
mincut = 1,
minsize = 2,
mindev = 0.001)
fit <- tree(Y~X, data = polls_2008,
control = tree_control)
cv_polls <- cv.tree(fit)
Score function to choose predictors and values to partitions:
where \( \hat{p}_{mk} \) is the proportion of training observations in partition \( m \) labeled as class \( k \). Both of these seek to partition observations into subsets that have the same labels.
pruned_fit <- prune.tree(fit)
plot(pruned_fit)
Advantages
Disadvantage
We use income distribution as an example:
The population median is
m <- median(income)
m
[1] 45369.52
set.seed(1)
N <- 250
X <- sample(income, N)
M <- median(X)
M
[1] 44334.42
Monte Carlo:
B <- 10^5
Ms <- replicate(B, {
X <- sample(income, N)
M <- median(X)
})
Idea behind bootstrap is to use sample as estimate of population: sample from the sample.
B <- 10^5
M_stars <- replicate(B, {
X_star <- sample(X, N, replace = TRUE)
M_star <- median(X_star)
})
qqplot(Ms, M_stars)
abline(0,1)
True CI: 39765.15 51721.77
Bootstrap estimate: 39549.38 50278.73
This is much better than what we get if we mindlessly use the CLT:
[1] 27583.05 61085.80
If we know the distribution is normal, we can use the bootstrap to estimate the mean:
mean(Ms) + 1.96*sd(Ms)*c(-1,1)
[1] 38346.74 52635.97
mean(M_stars) + 1.96*sd(M_stars)*c(-1,1)
[1] 37741.55 51923.76
General idea: average multiple decision trees. We construct a forest with randomness.
The first idea is Bagging (bootstrap aggregation)
To create \( T_j, \, j=1,\ldots,B \) from training set of size \( N \):
create a bootstrap training set by sampling \( N \) observations from training set with replacement
build a decision tree from bootstrap training set
->
<-
The second Random Forests feature is to use a random selection of features to split when deciding partitions.
When building each tree \( T_j \), at each recursive partition only consider a randomly selected subset of predictors to check for best split.
This reduces correlation between trees in forest, improving prediction accuracy.
fit <- randomForest(
as.factor(label)~X_1+X_2,
nodesize = 250,
data=digits_train)
CART: 0.8055037
First RF: 0.801306
Second RF: 0.8264925
pred <- predict(fit, newdata = test_set, type = "class")
confusionMatrix(table(pred = pred, true = test_set$label))$overall[1]
Accuracy
0.9659199
| feature | Gini |
|---|---|
| pixel378 | 315.1068 |
| pixel437 | 308.2002 |
| pixel434 | 291.9889 |
| pixel405 | 284.2842 |
| pixel290 | 252.8761 |
| pixel409 | 247.7304 |
| pixel350 | 245.8441 |
| pixel377 | 228.2036 |
| pixel515 | 227.6764 |
| pixel542 | 226.3225 |